Introduction

I began my undergraduate career as an Astronomy major, and after a few years, I moved on to actuarial science. While I did not finish the astronomy program, my passion for the topic has only grown since moving on. One of the great resources I have come across in my time studying astronomy is the Sloan Digital Sky Survey (SDSS). It is essentially a massive repository of data on stars and exoplanets across the galaxy. In an earlier research project, I used SDSS data to construct my own stellar classification system, and I cross referenced my own diagrams with the widely accepted Hertzprung-Russell Diagram, with reasonable results.

Temperature-Magnitude Diagram

Universally Accepted Hertzprung-Russell Diagram

HR-Diagram

Improving the Classification System:

I previously used logistic regression to predict the class of different stars. I would like to revisit it and see if I can use xgboost to classify the stars in my dataset. I will be using a dataset with 240 variables, and I will simply split the data and load it into xgboost for analysis. I am curious to see how the classification will work with multiple numeric columns. xgboost is great at taking sparse matrices and detecting patterns. Fortunately, there are not too many spectral classes, because they do need to be encoded to numeric variables for xgboost to work properly.

In this sample, there is only one G-Class star, so the model will have trouble if it faces the one G-Class star. I will drop the “Type” column because that is more or less covered in the temperature. The “Type” is a number from 0 to 9, and it denotes the temperature scale of the stars in each class (0 is the hottest, 9 is the coolest).

Another problem that xgboost will run into with classification is that if you split the train and test set prior to creating a sparse matrix, you will likely run across an issue where you don’t have the same predictor variables in the train and test matrices anymore. The way around this is to split the set AFTER constructing the sparse matrix.

##           Actual
## Prediction  0  1  2  3  4  5
##          0 29  0  0  0  0  0
##          1  0 13  0  0  0  0
##          2  0  3  4  1  0  0
##          3  0  0  0  3  0  0
##          4  0  1  0  0 16  0
##          5  0  0  0  0  0  2
## 
## FALSE  TRUE 
##     5    67

Reaction:

To show how easy it can be, I intentionally did not tune the hyperparameters. I am sure this model can be even further improved. With nearly no tuning involved, I was able to correctly classify 67 out of 72 stars based on 5 criteria. xgboost is an incredible tool that should not be overlooked.

Mapping the Universe

I also want to include a quick plot that I created using 3D coordinates, where x,y, and z coordinates were derived from right ascension and declination. In the case of the 3D plot, r was Observed Redshift.

\(x = cos(RA) \times cos(Dec)\)

\(y = sin(RA) \times cos(Dec)\)

\(z = Redshift \times sin(Dec)\)

3-D With R = Observed Redshift

Appendix: xgboost Code

link <- 'https://raw.githubusercontent.com/st3vejobs/607-Final-Project/main/nasa_HR_set.csv'
hr <- read.csv(url(link), na.strings = "")

hr <- hr %>%
  rename(label = Spectral_Class)

head(hr)

unique(hr$label)

labels <- c("M" = 0, "B" = 1,"A" = 2,"F" = 3,"O" = 4,"K" = 5,"G" = 6)

hr_archive <- hr
hr <- transform(hr, label = labels[as.character(label)])

hr <- hr[,-c(7)]

set.seed(34);
sample_rows <- sample(nrow(hr),nrow(hr) *.7)

dt <- sort(sample_rows)

test <- hr[-dt,]
train <- hr[dt,]

#XGBoost needs a matrix as input.

# It is CRITICAL that you make the sparse matrix prior to splitting the data in order for the algorithm to properly receive all the same columns in the test and train sets. 
full_mat <- sparse.model.matrix(label~.-1,data=hr)

trainm <- full_mat[dt,]

#trainm <- sparse.model.matrix(label ~ .-1,data = train)
train_label <- train[,"label"]

train_matrix <- xgb.DMatrix(data = as.matrix(trainm), label = train_label)

testm <- full_mat[-dt,]
#testm <- sparse.model.matrix(label~.-1,data = test)
test_label <- test[,"label"]
test_matrix <- xgb.DMatrix(data = testm, label = test_label)



nc <- length(unique(train_label))

xgb_params <- list("objective" = "multi:softmax","eval_metric" = "mlogloss", "num_class" = nc,"eta" = 0.01,"silent" = 1)

watchlist <- list(train = train_matrix, test = test_matrix)

# XGBoost Model

#100 iterations is overkill, trimming to 10
# Eta should be tweaked as well. Eta is 0.3 by default. Set to 0.01 (As small as possible)
# Because the dataset is very small and it won't be too heavy to complete. 

xgb_model <- xgb.train(params = xgb_params, data = train_matrix, nrounds = 500, watchlist = watchlist,verbose = 0)

e <- data.frame(xgb_model$evaluation_log)

ggplot(e, aes(iter,train_mlogloss))+
  geom_point(color = 'green',shape = 0)+
  geom_line(data = e, aes(x=iter,y=test_mlogloss),color = 'red')+
  ggtitle("XGBoost Model Log Loss vs. Number of Iterations")+
  theme(plot.title = element_text(hjust = 0.5))
  
# The plot shows that we start losing value after 10 iterations, so 10 iterations is a good amount for this model. 
#which(e$test_mlogloss == min(e$test_mlogloss))

imp <- xgb.importance(colnames(train_matrix),model = xgb_model)
xgb.plot.importance(imp)
#milk is the most important variable for predicting class. 

p <- predict(xgb_model,newdata = test_matrix)

#pred <- matrix(p, nrow = nc, ncol = length(p)) %>%
#  t() %>%
#  data.frame() %>%
#  mutate(label = test_label,max_prob = max.col(.,"last") - 1)

pred <- data.frame(cbind(p, test_label))

pred$correct <- pred$p == pred$test_label

table(Prediction = pred$p,Actual = pred$test_label)

table(pred$correct)